Key Phrase Based - Graph Representation for Contextual Similarity Between Documents
نویسندگان
چکیده
Finding similarity between documents which have no common key words has not received much attention till now. Here we develop a graph based representation for finding contextual similarity between documents which are totally disjoint in terms of its keywords. For this a bi-grams based key phrase approach is designed. Different algorithms for pairwise similarity were studied and evolved to suit them for our application. A classification technique using a key phrase graph was designed to classify a documents key phrases into commonly occurring contextually similar keywords. We give results and demonstrate the capability of our system to find contextual similarity between two docu-
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملPhrase-based Document Similarity Based on an Index Graph Model
Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than sing...
متن کاملA Graphical Framework For Contextual Search And Name Disambiguation In Email
Similarity measures for text have historically been an important tool for solving information retrieval problems. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a struct...
متن کاملRandom Indexing for Searching Large RDF Graphs
Querying large RDF spaces with traditional query languages such as SPARQL is challenging as it requires a familiarity with the structure of the RDF graph and the names (URIs) of its classes, properties and relevant individuals. In this paper, we propose a complementary approach based on Vector Space Models (VSM), more concretely Random Indexing (RI) [1] for building a semantic index for a large...
متن کاملA Graphical Framework for Contextual Search and Dismabiguation in Email
Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other non-textual objects in structure-rich data. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a l...
متن کامل